[DataFrame] Implement Indexer getitem by simon-mo · Pull Request #1955 · ray-project/ray

simon-mo · 2018-04-27T02:07:05Z

What do these changes do?

Implement loc and iloc's getitem methods

Note

Tests passed in private repo.

AmplabJenkins · 2018-04-27T03:10:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5078/
Test PASSed.

devin-petersohn

This is really great! I left some comments and questions.

devin-petersohn · 2018-04-29T20:09:52Z

python/ray/dataframe/dataframe.py

                Metadata for the new dataframe's columns
+            partial (boolean):
+                Internal: row_metadata and col_metadata only covers part of the
+                block partitions. (Used in index 'vew' accessor)


vew -> view.

devin-petersohn · 2018-04-29T20:10:20Z

python/ray/dataframe/dataframe.py

                 copy=False, col_partitions=None, row_partitions=None,
-                 block_partitions=None, row_metadata=None, col_metadata=None):
+                 block_partitions=None, row_metadata=None, col_metadata=None,
+                 partial=False):


Would it make more sense to have a DataFrameView object as a subclass of this one?

As I'm reading through this, I think it might make more sense to have it as a subclass. Changing the way that _block_partitions is handled in the case that it's a view makes complicated code more complicated.

Alternatively, if we decide otherwise, I would still like to see more comments about the _block_partitions changes.

I agree we should have it as a subclass. DataFrameView object make sense.

I'm refactoring parts of my indexing.py to make enlargement inside the parent class of the _LocIndexer and _iLocIndexer. So that we have cleaner code in indexing.py

devin-petersohn · 2018-04-29T20:12:03Z

python/ray/dataframe/dataframe.py

            if block_partitions is not None:
                # put in numpy array here to make accesses easier since it's 2D
                self._block_partitions = np.array(block_partitions)
+                if row_metadata is not None:


Why do we need this in two places?

devin-petersohn · 2018-04-29T20:19:23Z

python/ray/dataframe/dataframe.py

        Returns:
            The dtypes for this DataFrame.
        """
+        # Deal with empty column case


Prefer comment to say Deal with a DataFrame with no columns or something along those lines. Empty columns feels a bit ambiguous (could mean all NaN in some cases).

devin-petersohn · 2018-04-29T20:20:25Z

python/ray/dataframe/dataframe.py

+                Internal: row_metadata and col_metadata only covers part of the
+                block partitions. (Used in index 'vew' accessor)
        """
+        self.partial = partial


partial -> _partial

devin-petersohn · 2018-04-29T20:28:34Z

python/ray/dataframe/indexing.py

+        if is_2d(row_loc) and is_2d(col_loc):
+            return self._get_dataframe_view(row_loc, col_loc)
+
+    def _get_scaler(self, row_loc, col_loc):


Are you planning to implement these? It would be better to have a NotImplementedError than just pass. That way we don't have some silent internal error or something.

devin-petersohn · 2018-04-29T20:54:17Z

python/ray/dataframe/dataframe.py

            axis = 0
            columns = pd_df.columns
            index = pd_df.index
+            self._row_metadata = self._col_metadata = None


Duplicated (Remove this in favor of Line 73)

devin-petersohn · 2018-04-29T20:57:28Z

python/ray/dataframe/indexing.py

-        retrieved_rows_remote = self._map_partition(
-            lookup_dict, col_label, indexer='loc')
-        joined_df = pd.concat(ray.get(retrieved_rows_remote))
+    def _get_scaler(self, row_loc, col_loc):


Would you be able to add some comments about what these methods do (this and methods below)? Just so that we can quickly look at the code and maintain it better.

devin-petersohn · 2018-04-29T20:58:45Z

python/ray/dataframe/indexing.py

-            # The returned result need to be indexed series/df
-            # Re-index is needed.
-            joined_df.index = index_loc.index
+    def _get_scaler(self, row_loc, col_loc):


Same on this file about method level documentation.

simon-mo · 2018-05-05T05:49:22Z

@devin-petersohn I'll make the following change in the indexing.py, removing the triage redundancy:

simon-mo · 2018-05-09T08:02:13Z

Closed via #2020

simon-mo added 2 commits April 26, 2018 19:03

Implement getitem for loc and iloc

d7b6444

Resolve flake8

d4eefea

devin-petersohn reviewed Apr 29, 2018

View reviewed changes

simon-mo closed this May 9, 2018

Conversation

simon-mo commented Apr 27, 2018

What do these changes do?

Note

Uh oh!

AmplabJenkins commented Apr 27, 2018

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simon-mo commented May 5, 2018

Uh oh!

simon-mo commented May 9, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants